Graphics Processing Units (GPUs) are having a transformational effect onnumerical lattice quantum chromodynamics (LQCD) calculations of importance innuclear and particle physics. The QUDA library provides a package of mixedprecision sparse matrix linear solvers for LQCD applications, supporting singleGPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). Thislibrary, interfaced to the QDP++/Chroma framework for LQCD calculations, iscurrently in production use on the "9g" cluster at the Jefferson Laboratory,enabling unprecedented price/performance for a range of problems in LQCD.Nevertheless, memory constraints on current GPU devices limit the problem sizesthat can be tackled. In this contribution we describe the parallelization ofthe QUDA library onto multiple GPUs using MPI, including strategies for theoverlapping of communication and computation. We report on both weak and strongscaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain inexcess of 4 Tflops.
展开▼